nhslogo CS4132 Data Analytics

Kabir Jain¶

The evolution of Basketball and the NBA¶

Contents¶

  1. Motivation & Background
  2. Summary of Research Questions and Results
  3. Terminology
  4. Datasets
  5. Methodology
  6. Data Acquisition
  7. Data Cleaning
  8. EDA
  9. Results Findings & Conclusion
  10. Recommendations or Further Works
  11. References

Motivation and Background¶

Basketball is one of the most popular sports in the world. It is a team sport where opposing sides face off against one another to shoot a basketball into a hoop mounted 3 meters above the ground. This sport has had a great cultural impact worldwide and changed society.

Throughout the tides of time, basketball has always stood out as a beacon of entertainment. The biggest source of basketball entertainment has consistently been the NBA, to the point where they have become the central authority of the sport. I wish to analyze how the sport has changed over the years, specifically with the NBA. It would be very interesting to know how viewership, players and tactics have fluctuated over the years, and see how significant world events like covid-19 has impacted the sport.

Summary of Research Questions and Results¶

1. How have the players changed over time?¶

  • Have physical attributes like height and weight changed?
    • Players heights tended to increase during the early years of basketball, but then tended to plateau. Recently, the average height of players actually started to decrease. Some of the reason for this could be due to the Covid 19 pandemic.
    • The average weight on the other hand remained relatively constant, throughout the years.
  • Have there been an increase of players from other countries?
    • Yes, there is a great increase of players from other countries as time progresses. Players are also coming in from more and more countries.

2. How has the popularity of the sport changed over time?¶

  • Where is the game most viewed?
    • The sport is most viewed in the US, followed by China. The rest of the countries show interest in the sport, but the amount of interest is mostly shown by these 2 countries.
  • How does the game's popularity changed over time?
    • Over time, the sport has become more and more popular. However due to the Covid pandemic, there was a sharp decrease in popularity of the sport. Despite this, the sport seems to still be increasing in popularity throughout the pandemic. Throughout the year, basketball isn't the same level of popularity. Variations throughout the year is due to the special events that only happen once a year, such as the NBA All-stars event.

3. How has players salaries been affected?¶

  • What are the major contributing factors to a players salary?
    • It seems that the position played, and the number of positions played, is a huge factor in determining the pay of players. Players who are able to play the positions of Point-Guard, Power-Forward and Shooting-Guard simultaneously seem to get the most pay in comparison to their other team players.
  • How has salaries changed over the years?
    • Over the years, salaries have become a lot more varied, and spread out. The median salary has increased over time as well. Now players are paid better then they used to be paid, but in comparison to their coworkers, the variation can be a lot.

4. What is the evolution of shooting strategy within the court?¶

  • We see that due to the NBA pushing back the 3-point line, to encourage more higher scoring games, more and more players are taking 3 point shots, and fewer players are taking 2 point shots.

Terminology¶

We will be using a bit of technical words throughout this report, so I will define most of them here for clarity purposes.

image.png

the five positions are known by unique names: point guard (PG), the shooting guard (SG), the small forward (SF), the power forward (PF), and the center (C)

In basketball, a free throw is a specific kind of shot that is taken when a foul is called. A player on the opposing team receives free throws and shoots them from the free throw line when someone commits a shooting foul.

image.png

A field goal in basketball is a basket made with any shot or tap other than a free throw. A slam dunk is a particular kind of field goal. When a player jumps at the basket while holding the ball, the ball is thrown through the basket while in the air.

A three-pointer is a field goal that scores three points. A player counts each shot they make from beyond the three-point arc as a three-pointer (in the diagram, behind the pink shaded region)

image.png

Term -- Meaning¶

  • Age -- Player's age on February 1 of the season
  • Tm -- Team
  • Lg -- League
  • Pos -- Position
  • G -- Games
  • GS -- Games Started
  • MP -- Minutes Played
  • FG -- Field Goals
  • FGA -- Field Goal Attempts
  • FG% -- Field Goal Percentage
  • 3P -- 3-Point Field Goals
  • 3PA -- 3-Point Field Goal Attempts
  • 3P% -- 3-Point Field Goal Percentage
  • 2P -- 2-Point Field Goals
  • 2PA -- 2-point Field Goal Attempts
  • 2P% -- 2-Point Field Goal Percentage
  • eFG% -- Effective Field Goal Percentage, This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
  • FT -- Free Throws
  • FTA -- Free Throw Attempts
  • FT% -- Free Throw Percentage
  • ORB -- Offensive Rebounds
  • DRB -- Defensive Rebounds
  • TRB -- Total Rebounds
  • AST -- Assists
  • STL -- Steals
  • BLK -- Blocks
  • TOV -- Turnovers
  • PF -- Personal Fouls
  • PTS -- Points

Dataset¶

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. https://www.basketball-reference.com/players/ Information about each of the players, teams and salaries.
  2. https://countrycode.org/ Alpha‑2 codes for each country
  3. https://www.bls.gov/respondents/mwr/electronic-data-interchange/appendix-d-usps-state-abbreviations-and-fips-codes.htm Individual state codes
  4. multiTimeline.csv Contains interest of basketball over time (2004-2022), obtained from https://trends.google.com/trends/
  5. geoMap (1).csv Contains interest of basketball by country, obtained from https://trends.google.com/trends/
  6. https://www.iban.com/country-codes Alpha-2 codes to Alpha-3 codes
  7. games.csv All games from 2004 season to 2022 with the teams, date and game details like number of points, etc.
  8. games_details.csv Details of games dataset, all statistics of players for a given game

Methodology¶

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).
In [1]:
!pip install html5lib
!pip install gtab
!pip install plotly
!pip install chart_studio
!pip install mlxtend
Requirement already satisfied: html5lib in c:\users\admin\anaconda3\lib\site-packages (1.1)
Requirement already satisfied: webencodings in c:\users\admin\anaconda3\lib\site-packages (from html5lib) (0.5.1)
Requirement already satisfied: six>=1.9 in c:\users\admin\anaconda3\lib\site-packages (from html5lib) (1.16.0)
Requirement already satisfied: gtab in c:\users\admin\anaconda3\lib\site-packages (0.8)
Requirement already satisfied: tqdm in c:\users\admin\anaconda3\lib\site-packages (from gtab) (4.64.0)
Requirement already satisfied: networkx in c:\users\admin\anaconda3\lib\site-packages (from gtab) (2.7.1)
Requirement already satisfied: pytrends in c:\users\admin\anaconda3\lib\site-packages (from gtab) (4.8.0)
Requirement already satisfied: pandas in c:\users\admin\anaconda3\lib\site-packages (from gtab) (1.4.2)
Requirement already satisfied: numpy in c:\users\admin\anaconda3\lib\site-packages (from gtab) (1.21.5)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\anaconda3\lib\site-packages (from pandas->gtab) (2021.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\admin\anaconda3\lib\site-packages (from pandas->gtab) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\admin\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas->gtab) (1.16.0)
Requirement already satisfied: requests>=2.0 in c:\users\admin\anaconda3\lib\site-packages (from pytrends->gtab) (2.27.1)
Requirement already satisfied: lxml in c:\users\admin\anaconda3\lib\site-packages (from pytrends->gtab) (4.8.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (1.26.9)
Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (2022.6.15)
Requirement already satisfied: colorama in c:\users\admin\anaconda3\lib\site-packages (from tqdm->gtab) (0.4.4)
Requirement already satisfied: plotly in c:\users\admin\anaconda3\lib\site-packages (5.6.0)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from plotly) (8.0.1)
Requirement already satisfied: six in c:\users\admin\anaconda3\lib\site-packages (from plotly) (1.16.0)
Requirement already satisfied: chart_studio in c:\users\admin\anaconda3\lib\site-packages (1.1.0)
Requirement already satisfied: requests in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (2.27.1)
Requirement already satisfied: six in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (1.16.0)
Requirement already satisfied: retrying>=1.3.3 in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (1.3.3)
Requirement already satisfied: plotly in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (5.6.0)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from plotly->chart_studio) (8.0.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (1.26.9)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (2022.6.15)
Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (3.3)
Requirement already satisfied: mlxtend in c:\users\admin\anaconda3\lib\site-packages (0.21.0)
Requirement already satisfied: scipy>=1.2.1 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.7.3)
Requirement already satisfied: joblib>=0.13.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.1.0)
Requirement already satisfied: matplotlib>=3.0.0 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (3.5.1)
Requirement already satisfied: setuptools in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (61.2.0)
Requirement already satisfied: numpy>=1.16.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.21.5)
Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.0.2)
Requirement already satisfied: pandas>=0.24.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.4.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (21.3)
Requirement already satisfied: pillow>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (9.0.1)
Requirement already satisfied: cycler>=0.10 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.3.2)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (3.0.4)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (4.25.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2021.3)
Requirement already satisfied: six>=1.5 in c:\users\admin\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.0.0->mlxtend) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from scikit-learn>=1.0.2->mlxtend) (2.2.0)
In [2]:
import pandas as pd
import unidecode
from tqdm.notebook import tqdm
import plotly.graph_objects as go
import numpy as np
from bs4 import BeautifulSoup, Comment
import re
import chart_studio.plotly as py
import plotly.offline as po
import plotly.graph_objs as pg
import matplotlib.pyplot as plt
import requests
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from scipy.optimize import curve_fit
from sklearn.linear_model import LogisticRegression
from plotly.colors import n_colors
from statsmodels.graphics import tsaplots
from importlib import reload
from mpl_toolkits.mplot3d import Axes3D

tqdm.pandas()

Data Acquisition¶

Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.

We first obtain the details from the basketballreference website. This will give us all the player names, height, weight, place of birth, etc.

In [3]:
Players = pd.read_csv("Players.csv")
Players
Out[3]:
Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges
0 0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke
1 1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State
2 2 Kareem Abdul-Jabbar* 1970 1989 C 7-2 225.0 April 16, 1947 UCLA
3 3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 LSU
4 4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 Michigan, San Jose State
... ... ... ... ... ... ... ... ... ...
5018 5018 Ante Žižić 2018 2020 F-C 6-10 266.0 January 4, 1997 NaN
5019 5019 Jim Zoet 1983 1983 C 7-1 240.0 December 20, 1953 Kent State University
5020 5020 Bill Zopf 1971 1971 G 6-1 170.0 June 7, 1948 Duquesne
5021 5021 Ivica Zubac 2017 2022 C 7-0 240.0 March 18, 1997 NaN
5022 5022 Matt Zunic 1949 1949 G-F 6-3 195.0 December 19, 1919 George Washington

5023 rows × 9 columns

In [4]:
salaryData = pd.read_csv("SalaryTeamData2.csv")
salaryData
Out[4]:
Unnamed: 0 Season Age Tm Lg Pos G GS MP FG ... AST STL BLK TOV PF PTS Team Salary Unnamed: 30 Trp Dbl
0 0 1990-91 22.0 POR NBA PF 5.0 0.0 2.6 0.4 ... 0.0 0.0 0.0 0.0 0.0 0.8 Portland Trail Blazers $395,000 NaN NaN
1 1 1991-92 23.0 POR NBA PF 8.0 0.0 3.1 0.6 ... 0.3 0.0 0.0 0.3 0.5 1.5 Portland Trail Blazers $494,000 NaN NaN
2 2 1992-93 24.0 BOS NBA PF 4.0 4.0 17.0 2.8 ... 0.3 0.0 0.3 2.3 1.8 5.5 Boston Celtics $500,000 NaN NaN
3 0 1984-85 37.0 LAL NBA C 19.0 19.0 32.1 8.8 ... 4.0 1.2 1.9 2.7 3.5 21.9 Los Angeles Lakers $1,530,000 NaN NaN
4 1 1985-86 38.0 LAL NBA C 14.0 14.0 34.9 11.2 ... 3.5 1.1 1.7 3.0 3.9 25.9 Los Angeles Lakers $2,030,000 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10759 3 2018-19 21.0 LAL NBA C 33.0 12.0 15.6 3.4 ... 0.8 0.1 0.8 1.0 2.2 8.5 Los Angeles Clippers $1,544,951 NaN NaN
10760 4 2018-19 21.0 LAC NBA C 26.0 25.0 20.2 3.8 ... 1.5 0.4 0.9 1.4 2.5 9.4 Los Angeles Clippers $1,544,951 NaN NaN
10761 5 2019-20 22.0 LAC NBA C 72.0 70.0 18.4 3.3 ... 1.1 0.2 0.9 0.8 2.3 8.3 Los Angeles Clippers $6,481,482 NaN NaN
10762 6 2020-21 23.0 LAC NBA C 72.0 33.0 22.3 3.6 ... 1.3 0.3 0.9 1.1 2.6 9.0 Los Angeles Clippers $7,000,000 NaN NaN
10763 7 2021-22 24.0 LAC NBA C 76.0 76.0 24.4 4.1 ... 1.6 0.5 1.0 1.5 2.7 10.3 Los Angeles Clippers $7,518,518 NaN NaN

10764 rows × 35 columns

In [5]:
inflation_data = pd.read_csv("inflation_data.csv")
inflation_data
Out[5]:
year amount inflation rate
0 1800 1.00 0.02
1 1801 1.01 0.01
2 1802 0.85 -0.16
3 1803 0.90 0.06
4 1804 0.94 0.04
... ... ... ...
218 2018 19.94 0.02
219 2019 20.29 0.02
220 2020 20.54 0.01
221 2021 21.51 0.05
222 2022 23.51 0.09

223 rows × 3 columns

However, later when we analyze each player, and go onto their personal pages, the websites give us the full country name (or state for the US) name. However, we want the alpha-3 codes to plot this, so we will get the following codes, obtaioned from web scraping.

In [6]:
country_codes_data = pd.read_csv("CountryCodes.csv")
country_codes_data
Out[6]:
Unnamed: 0 COUNTRY COUNTRY CODE ISO CODES
0 0 Afghanistan 93 AF / AFG
1 1 Albania 355 AL / ALB
2 2 Algeria 213 DZ / DZA
3 3 American Samoa 1-684 AS / ASM
4 4 Andorra 376 AD / AND
... ... ... ... ...
235 235 Wallis and Futuna 681 WF / WLF
236 236 Western Sahara 212 EH / ESH
237 237 Yemen 967 YE / YEM
238 238 Zambia 260 ZM / ZMB
239 239 Zimbabwe 263 ZW / ZWE

240 rows × 4 columns

In [7]:
state_codes_data = pd.read_csv("StateCodes.csv")#state_codes_data
state_codes_data
Out[7]:
Unnamed: 0 0 1 2 3 4 5
0 0 State Postal Abbr. FIPS Code State Postal Abbr. FIPS Code
1 1 Alabama AL 01 Nebraska NE 31
2 2 Alaska AK 02 Nevada NV 32
3 3 Arizona AZ 04 New Hampshire NH 33
4 4 Arkansas AR 05 New Jersey NJ 34
5 5 California CA 06 New Mexico NM 35
6 6 Colorado CO 08 New York NY 36
7 7 Connecticut CT 09 North Carolina NC 37
8 8 Delaware DE 10 North Dakota ND 38
9 9 District of Columbia DC 11 Ohio OH 39
10 10 Florida FL 12 Oklahoma OK 40
11 11 Georgia GA 13 Oregon OR 41
12 12 Hawaii HI 15 Pennsylvania PA 42
13 13 Idaho ID 16 Puerto Rico PR 72
14 14 Illinois IL 17 Rhode Island RI 44
15 15 Indiana IN 18 South Carolina SC 45
16 16 Iowa IA 19 South Dakota SD 46
17 17 Kansas KS 20 Tennessee TN 47
18 18 Kentucky KY 21 Texas TX 48
19 19 Louisiana LA 22 Utah UT 49
20 20 Maine ME 23 Vermont VT 50
21 21 Maryland MD 24 Virginia VA 51
22 22 Massachusetts MA 25 Virgin Islands VI 78
23 23 Michigan MI 26 Washington WA 53
24 24 Minnesota MN 27 West Virginia WV 54
25 25 Mississippi MS 28 Wisconsin WI 55
26 26 Missouri MO 29 Wyoming WY 56
27 27 Montana MT 30 Â Â Â
In [8]:
country_codes_dataa = pd.read_csv("CountryCodesThreeLetter.csv")
country_codes_dataa
Out[8]:
Unnamed: 0 Country Alpha-2 code Alpha-3 code Numeric
0 0 Afghanistan AF AFG 4
1 1 Åland Islands AX ALA 248
2 2 Albania AL ALB 8
3 3 Algeria DZ DZA 12
4 4 American Samoa AS ASM 16
... ... ... ... ... ...
244 244 Wallis and Futuna WF WLF 876
245 245 Western Sahara EH ESH 732
246 246 Yemen YE YEM 887
247 247 Zambia ZM ZMB 894
248 248 Zimbabwe ZW ZWE 716

249 rows × 5 columns

To gauge the interest over time, since the NBA viewership is not made public, we can do this indirectly by taking interest from google trends

In [9]:
BasketBallTrends = pd.read_csv("multiTimeline.csv")
BasketBallTrends
Out[9]:
Category: All categories
Month Basketball: (Worldwide)
2004-01 47
2004-02 48
2004-03 67
2004-04 34
... ...
2022-05 32
2022-06 32
2022-07 28
2022-08 28
2022-09 44

226 rows × 1 columns

In [10]:
geo = pd.read_csv("geoMap (1).csv")
geo
Out[10]:
Category: All categories
Country Basketball: (01/01/2004 - 20/09/2022)
Lithuania 100
United States 56
Montenegro 56
Marshall Islands 55
... ...
Tokelau NaN
Tuvalu NaN
US Outlying Islands NaN
Vatican City NaN
Wallis & Futuna NaN

251 rows × 1 columns

In [11]:
details = pd.read_csv('games_details.csv')
details
C:\Users\admin\AppData\Local\Temp\ipykernel_25424\2901751251.py:1: DtypeWarning:

Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.

Out[11]:
GAME_ID TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID PLAYER_NAME NICKNAME START_POSITION COMMENT MIN ... OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS
0 22101005 1610612750 MIN Minnesota 1630162 Anthony Edwards Anthony F NaN 36:22 ... 0.0 8.0 8.0 5.0 3.0 1.0 1.0 1.0 15.0 5.0
1 22101005 1610612750 MIN Minnesota 1630183 Jaden McDaniels Jaden F NaN 23:54 ... 2.0 4.0 6.0 0.0 0.0 2.0 2.0 6.0 14.0 10.0
2 22101005 1610612750 MIN Minnesota 1626157 Karl-Anthony Towns Karl-Anthony C NaN 25:17 ... 1.0 9.0 10.0 0.0 0.0 0.0 3.0 4.0 15.0 14.0
3 22101005 1610612750 MIN Minnesota 1627736 Malik Beasley Malik G NaN 30:52 ... 0.0 3.0 3.0 1.0 1.0 0.0 1.0 4.0 12.0 20.0
4 22101005 1610612750 MIN Minnesota 1626156 D'Angelo Russell D'Angelo G NaN 33:46 ... 0.0 6.0 6.0 9.0 1.0 0.0 5.0 0.0 14.0 17.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
645948 11200005 1610612743 DEN Denver 202706 Jordan Hamilton NaN NaN NaN 19 ... 0.0 2.0 2.0 0.0 2.0 0.0 1.0 3.0 17.0 NaN
645949 11200005 1610612743 DEN Denver 202702 Kenneth Faried NaN NaN NaN 23 ... 1.0 0.0 1.0 1.0 1.0 0.0 3.0 3.0 18.0 NaN
645950 11200005 1610612743 DEN Denver 201585 Kosta Koufos NaN NaN NaN 15 ... 3.0 5.0 8.0 0.0 1.0 0.0 0.0 3.0 6.0 NaN
645951 11200005 1610612743 DEN Denver 202389 Timofey Mozgov NaN NaN NaN 19 ... 1.0 2.0 3.0 1.0 0.0 0.0 4.0 2.0 2.0 NaN
645952 11200005 1610612743 DEN Denver 201951 Ty Lawson NaN NaN NaN 27 ... 0.0 2.0 2.0 6.0 2.0 0.0 6.0 1.0 8.0 NaN

645953 rows × 29 columns

In [12]:
games = pd.read_csv('games.csv')[["GAME_ID","SEASON"]]
games
Out[12]:
GAME_ID SEASON
0 22101005 2021
1 22101006 2021
2 22101007 2021
3 22101008 2021
4 22101009 2021
... ... ...
25791 11400007 2014
25792 11400004 2014
25793 11400005 2014
25794 11400002 2014
25795 11400001 2014

25796 rows × 2 columns

Data Cleaning¶

There is a lot of data that is not needed in the country code and states code data, such as FIPS code, s/no, etc.

So we will remove these.

In [13]:
country_code = (pd.DataFrame([country_codes_data["COUNTRY"],country_codes_data["ISO CODES"].str.split(" / ").str[0]])).T
country_code[country_code["ISO CODES"] != "US"]
#country_code.columns = ["COUNTRY", "ISO CODES"]
country_code
Out[13]:
COUNTRY ISO CODES
0 Afghanistan AF
1 Albania AL
2 Algeria DZ
3 American Samoa AS
4 Andorra AD
... ... ...
235 Wallis and Futuna WF
236 Western Sahara EH
237 Yemen YE
238 Zambia ZM
239 Zimbabwe ZW

240 rows × 2 columns

In [14]:
state_codes_data = pd.read_csv("StateCodes.csv")
state_codes_dat = pd.concat([state_codes_data.loc[1:,:'2'],state_codes_data.loc[1:,'3':].rename(columns={'3':'0','4':'1','5':'2'})], ignore_index=True)
state_codes_dat = state_codes_dat.loc[:52,'0':'1']
state_code = (pd.DataFrame([state_codes_dat['0'],state_codes_dat['1']])).T
state_code.columns = ["Country","Alpha code"]
state_code["AlphaThree Code"] = "USA"
state_code
Out[14]:
Country Alpha code AlphaThree Code
0 Alabama AL USA
1 Alaska AK USA
2 Arizona AZ USA
3 Arkansas AR USA
4 California CA USA
5 Colorado CO USA
6 Connecticut CT USA
7 Delaware DE USA
8 District of Columbia DC USA
9 Florida FL USA
10 Georgia GA USA
11 Hawaii HI USA
12 Idaho ID USA
13 Illinois IL USA
14 Indiana IN USA
15 Iowa IA USA
16 Kansas KS USA
17 Kentucky KY USA
18 Louisiana LA USA
19 Maine ME USA
20 Maryland MD USA
21 Massachusetts MA USA
22 Michigan MI USA
23 Minnesota MN USA
24 Mississippi MS USA
25 Missouri MO USA
26 Montana MT USA
27 Nebraska NE USA
28 Nevada NV USA
29 New Hampshire NH USA
30 New Jersey NJ USA
31 New Mexico NM USA
32 New York NY USA
33 North Carolina NC USA
34 North Dakota ND USA
35 Ohio OH USA
36 Oklahoma OK USA
37 Oregon OR USA
38 Pennsylvania PA USA
39 Puerto Rico PR USA
40 Rhode Island RI USA
41 South Carolina SC USA
42 South Dakota SD USA
43 Tennessee TN USA
44 Texas TX USA
45 Utah UT USA
46 Vermont VT USA
47 Virginia VA USA
48 Virgin Islands VI USA
49 Washington WA USA
50 West Virginia WV USA
51 Wisconsin WI USA
52 Wyoming WY USA
In [15]:
country_code_alpha = (pd.DataFrame([country_codes_dataa["Alpha-2 code"],country_codes_dataa["Alpha-3 code"]])).T
country_code_alpha
Out[15]:
Alpha-2 code Alpha-3 code
0 AF AFG
1 AX ALA
2 AL ALB
3 DZ DZA
4 AS ASM
... ... ...
244 WF WLF
245 EH ESH
246 YE YEM
247 ZM ZMB
248 ZW ZWE

249 rows × 2 columns

The names obtained from the website must be decoded to be usable. Accents on letters make it a lot harder to use down the line. Hence we will be applying this function to decode it all.

In [16]:
def personalInfo(row):
    row.Player = unidecode.unidecode(row.Player)
    return row
In [17]:
Players.apply(personalInfo,axis="columns")
Out[17]:
Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges
0 0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke
1 1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State
2 2 Kareem Abdul-Jabbar* 1970 1989 C 7-2 225.0 April 16, 1947 UCLA
3 3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 LSU
4 4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 Michigan, San Jose State
... ... ... ... ... ... ... ... ... ...
5018 5018 Ante Zizic 2018 2020 F-C 6-10 266.0 January 4, 1997 NaN
5019 5019 Jim Zoet 1983 1983 C 7-1 240.0 December 20, 1953 Kent State University
5020 5020 Bill Zopf 1971 1971 G 6-1 170.0 June 7, 1948 Duquesne
5021 5021 Ivica Zubac 2017 2022 C 7-0 240.0 March 18, 1997 NaN
5022 5022 Matt Zunic 1949 1949 G-F 6-3 195.0 December 19, 1919 George Washington

5023 rows × 9 columns

To obtain the country and U.S. states of eeach individual player, we have to scrape each birthplaces website, and match it to the player.

In [18]:
currentcode=""
def labelUsers(row):
    global currentcode
    global Players
    Players.loc[Players["Player"]==row["Player"], "Country"] = currentcode
In [19]:
def findPlayersnotUS(row):
    global currentcode
    try:
        if(row["ISO CODES"]=="US"):
            pass
        web = "https://www.basketball-reference.com/friv/birthplaces.fcgi?country="+row["ISO CODES"]+"&state="
        url = requests.get(web)
        data = pd.read_html(url.text)
        currentcode = row["ISO CODES"]
        data[-1].columns = data[-1].columns.droplevel(0)
        data[-1].insert(2,"Country",[row["ISO CODES"] for i in range(len(data[-1]))])
        data[-1].apply(labelUsers,axis="columns")
    except ValueError:
        pass
    except TypeError:
        pass
In [20]:
def findPlayersUS(row):
    global currentcode
    #print(row["Alpha code"])
    web = "https://www.basketball-reference.com/friv/birthplaces.fcgi?country=US&state="+row["Alpha code"]
    url = requests.get(web)
    try:
        
        data = pd.read_html(url.text)
        currentcode = row["Alpha code"]
        data[-1].columns = data[-1].columns.droplevel(0)
        data[-1].insert(2,"Country",[row["Alpha code"] for i in range(len(data[-1]))])
        data[-1].apply(labelUsers,axis="columns")
    except ValueError:
        pass
In [21]:
Players.insert(len(Players.columns),"Country",["" for i in range(len(Players.Player))])
Players
Out[21]:
Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges Country
0 0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke
1 1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State
2 2 Kareem Abdul-Jabbar* 1970 1989 C 7-2 225.0 April 16, 1947 UCLA
3 3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 LSU
4 4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 Michigan, San Jose State
... ... ... ... ... ... ... ... ... ... ...
5018 5018 Ante Žižić 2018 2020 F-C 6-10 266.0 January 4, 1997 NaN
5019 5019 Jim Zoet 1983 1983 C 7-1 240.0 December 20, 1953 Kent State University
5020 5020 Bill Zopf 1971 1971 G 6-1 170.0 June 7, 1948 Duquesne
5021 5021 Ivica Zubac 2017 2022 C 7-0 240.0 March 18, 1997 NaN
5022 5022 Matt Zunic 1949 1949 G-F 6-3 195.0 December 19, 1919 George Washington

5023 rows × 10 columns

In [22]:
country_code.progress_apply(findPlayersnotUS,axis="columns")
  0%|          | 0/240 [00:00<?, ?it/s]
Out[22]:
0      None
1      None
2      None
3      None
4      None
       ... 
235    None
236    None
237    None
238    None
239    None
Length: 240, dtype: object
In [23]:
state_code.progress_apply(findPlayersUS,axis="columns")
  0%|          | 0/53 [00:00<?, ?it/s]
Out[23]:
0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
22    None
23    None
24    None
25    None
26    None
27    None
28    None
29    None
30    None
31    None
32    None
33    None
34    None
35    None
36    None
37    None
38    None
39    None
40    None
41    None
42    None
43    None
44    None
45    None
46    None
47    None
48    None
49    None
50    None
51    None
52    None
dtype: object

Now we remove the duplicate indexes in our Players dataframe. We will also convert the Height data (which is in feet/inches) to centimeters.

In [24]:
Players = Players.loc[~Players.index.duplicated(), :]
Players["HtCm"]=(12*pd.to_numeric(Players["Ht"].str[:1])+pd.to_numeric(Players["Ht"].str[2:]))*2.54
Players
Out[24]:
Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges Country HtCm
0 0 Alaa Abdelnaby 1991 1995 F-C 6-10 240.0 June 24, 1968 Duke EG 208.28
1 1 Zaid Abdul-Aziz 1969 1978 C-F 6-9 235.0 April 7, 1946 Iowa State NY 205.74
2 2 Kareem Abdul-Jabbar* 1970 1989 C 7-2 225.0 April 16, 1947 UCLA NY 218.44
3 3 Mahmoud Abdul-Rauf 1991 2001 G 6-1 162.0 March 9, 1969 LSU MS 185.42
4 4 Tariq Abdul-Wahad 1998 2003 F 6-6 223.0 November 3, 1974 Michigan, San Jose State FR 198.12
... ... ... ... ... ... ... ... ... ... ... ...
5018 5018 Ante Žižić 2018 2020 F-C 6-10 266.0 January 4, 1997 NaN HR 208.28
5019 5019 Jim Zoet 1983 1983 C 7-1 240.0 December 20, 1953 Kent State University CA 215.90
5020 5020 Bill Zopf 1971 1971 G 6-1 170.0 June 7, 1948 Duquesne 185.42
5021 5021 Ivica Zubac 2017 2022 C 7-0 240.0 March 18, 1997 NaN BA 213.36
5022 5022 Matt Zunic 1949 1949 G-F 6-3 195.0 December 19, 1919 George Washington PA 190.50

5023 rows × 11 columns

We can now add the countries to the players code, to be able to plot it later on.

In [25]:
Players = pd.merge(Players.copy(),country_code_alpha,left_on="Country",right_on="Alpha-2 code",how="outer")
Players = pd.merge(Players.copy(),state_code,left_on="Country",right_on="Alpha code",how="outer")
Players["AlphaThree Code"].fillna(Players["Alpha-3 code"],inplace=True)
Players["Alpha-3 code"] = Players["AlphaThree Code"]
Players=Players[~Players["Player"].isnull()]
Players
Out[25]:
Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges Country_x HtCm Alpha-2 code Alpha-3 code Country_y Alpha code AlphaThree Code
0 0.0 Alaa Abdelnaby 1991.0 1995.0 F-C 6-10 240.0 June 24, 1968 Duke EG 208.28 EG EGY NaN NaN EGY
1 3209.0 Abdel Nader 2018.0 2022.0 F 6-5 225.0 September 25, 1993 Northern Illinois, Iowa State EG 195.58 EG EGY NaN NaN EGY
2 1.0 Zaid Abdul-Aziz 1969.0 1978.0 C-F 6-9 235.0 April 7, 1946 Iowa State NY 205.74 NaN USA New York NY USA
3 2.0 Kareem Abdul-Jabbar* 1970.0 1989.0 C 7-2 225.0 April 16, 1947 UCLA NY 218.44 NaN USA New York NY USA
4 12.0 Don Ackerman 1954.0 1954.0 G 6-0 183.0 September 4, 1930 Long Island University NY 182.88 NaN USA New York NY USA
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5018 4009.0 Ha Seung-Jin 2005.0 2006.0 C 7-3 305.0 August 4, 1985 NaN KR 220.98 KR KOR NaN NaN KOR
5019 4338.0 Edy Tavares 2016.0 2017.0 C 7-3 260.0 March 22, 1992 NaN CV 220.98 CV CPV NaN NaN CPV
5020 4375.0 Hasheem Thabeet 2010.0 2014.0 C 7-3 263.0 February 16, 1987 UConn TZ 220.98 TZ TZA NaN NaN TZA
5021 4469.0 Óscar Torres 2002.0 2003.0 F 6-6 210.0 December 18, 1976 NaN VE 198.12 VE VEN NaN NaN VEN
5022 4560.0 Greivis Vásquez 2011.0 2017.0 G 6-6 217.0 January 16, 1987 Maryland VE 198.12 VE VEN NaN NaN VEN

5023 rows × 16 columns

Now, to process the map dataset obtained from google trends, we will rename the columns, and convert the country names to alpha-3 codes.

In [26]:
new_header = geo.iloc[0]
geo = geo[1:]
geo.columns = new_header
geo = geo.loc[geo["Basketball: (01/01/2004 - 20/09/2022)"].notnull()]
geo = pd.merge(geo,pd.read_csv("List of Countries by number of Internet Users - Sheet1.csv"),left_index=True,right_on="Country or Area",how="inner")
geo["Basketball: (01/01/2004 - 20/09/2022)"] = pd.to_numeric(geo["Basketball: (01/01/2004 - 20/09/2022)"] )* pd.to_numeric(geo["Internet Users"].str.replace(",",""))
geo = pd.merge(geo,country_code,left_on="Country or Area",right_on="COUNTRY")
geo = pd.merge(geo,country_code_alpha,left_on="ISO CODES",right_on="Alpha-2 code")
geo["Basketball: (01/01/2004 - 20/09/2022)"] = geo["Basketball: (01/01/2004 - 20/09/2022)"]/geo['Basketball: (01/01/2004 - 20/09/2022)'].max()
geo
Out[26]:
Basketball: (01/01/2004 - 20/09/2022) Country or Area Internet Users Population Rank Percentage Rank.1 COUNTRY ISO CODES Alpha-2 code Alpha-3 code
0 0.016413 Lithuania 2,243,448 2,890,297 115 77.62% 58 Lithuania LT LT LTU
1 1.000000 United States 244,090,854 324,459,463 3 75.23% 68 United States US US USA
2 0.001836 Montenegro 448,260 628,960 154 71.27% 75 Montenegro ME ME MNE
3 0.000083 Marshall Islands 20,560 53,127 203 38.70% 138 Marshall Islands MH MH MHL
4 0.029671 Greece 7,799,565 11,159,773 58 69.89% 77 Greece GR GR GRC
... ... ... ... ... ... ... ... ... ... ... ...
179 0.067502 India 461,347,554 1,339,180,127 2 34.45% 145 India IN IN IND
180 0.001831 Sudan 12,512,639 40,533,330 46 30.87% 153 Sudan SD SD SDN
181 0.001104 Yemen 7,548,512 28,250,420 62 26.72% 164 Yemen YE YE YEM
182 0.009048 Pakistan 61,837,331 220,800,300 25 30.68% 184 Pakistan PK PK PAK
183 0.004467 Bangladesh 30,530,435 164,669,751 27 18.02% 180 Bangladesh BD BD BGD

184 rows × 11 columns

We process the SalaryData, and make the salary and season numeric.

In [27]:
salaryData["Salary"] = salaryData["Salary"].str.replace(",","").str[1:]
salaryData["Salary"] = pd.to_numeric(salaryData["Salary"], errors="coerce")
salaryData["Season"] = pd.to_numeric(salaryData["Season"].str[:4])
salaryData
Out[27]:
Unnamed: 0 Season Age Tm Lg Pos G GS MP FG ... AST STL BLK TOV PF PTS Team Salary Unnamed: 30 Trp Dbl
0 0 1990 22.0 POR NBA PF 5.0 0.0 2.6 0.4 ... 0.0 0.0 0.0 0.0 0.0 0.8 Portland Trail Blazers 395000.0 NaN NaN
1 1 1991 23.0 POR NBA PF 8.0 0.0 3.1 0.6 ... 0.3 0.0 0.0 0.3 0.5 1.5 Portland Trail Blazers 494000.0 NaN NaN
2 2 1992 24.0 BOS NBA PF 4.0 4.0 17.0 2.8 ... 0.3 0.0 0.3 2.3 1.8 5.5 Boston Celtics 500000.0 NaN NaN
3 0 1984 37.0 LAL NBA C 19.0 19.0 32.1 8.8 ... 4.0 1.2 1.9 2.7 3.5 21.9 Los Angeles Lakers 1530000.0 NaN NaN
4 1 1985 38.0 LAL NBA C 14.0 14.0 34.9 11.2 ... 3.5 1.1 1.7 3.0 3.9 25.9 Los Angeles Lakers 2030000.0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10759 3 2018 21.0 LAL NBA C 33.0 12.0 15.6 3.4 ... 0.8 0.1 0.8 1.0 2.2 8.5 Los Angeles Clippers 1544951.0 NaN NaN
10760 4 2018 21.0 LAC NBA C 26.0 25.0 20.2 3.8 ... 1.5 0.4 0.9 1.4 2.5 9.4 Los Angeles Clippers 1544951.0 NaN NaN
10761 5 2019 22.0 LAC NBA C 72.0 70.0 18.4 3.3 ... 1.1 0.2 0.9 0.8 2.3 8.3 Los Angeles Clippers 6481482.0 NaN NaN
10762 6 2020 23.0 LAC NBA C 72.0 33.0 22.3 3.6 ... 1.3 0.3 0.9 1.1 2.6 9.0 Los Angeles Clippers 7000000.0 NaN NaN
10763 7 2021 24.0 LAC NBA C 76.0 76.0 24.4 4.1 ... 1.6 0.5 1.0 1.5 2.7 10.3 Los Angeles Clippers 7518518.0 NaN NaN

10764 rows × 35 columns

To account for inflation we can take in the inflation_data, and process it

In [28]:
inflation_data.index = inflation_data["year"]
inflation_data = inflation_data.drop("year",axis=1)
inflation_data = inflation_data.loc[1984:]
inflation_data = inflation_data["amount"]
inflation_data = inflation_data/inflation_data.loc[2022]
inflation_data
Out[28]:
year
1984    0.350915
1985    0.363250
1986    0.370055
1987    0.383667
1988    0.399405
1989    0.418545
1990    0.441089
1991    0.459804
1992    0.473416
1993    0.487877
1994    0.500213
1995    0.514675
1996    0.529562
1997    0.541897
1998    0.550404
1999    0.562314
2000    0.581455
2001    0.598043
2002    0.607401
2003    0.621012
2004    0.637601
2005    0.659294
2006    0.680561
2007    0.700128
2008    0.726925
2009    0.724373
2010    0.736282
2011    0.759251
2012    0.774989
2013    0.786474
2014    0.799234
2015    0.800085
2016    0.810293
2017    0.827308
2018    0.848150
2019    0.863037
2020    0.873671
2021    0.914930
2022    1.000000
Name: amount, dtype: float64

We will now process the games data

In [29]:
details.drop(["GAME_ID","TEAM_ID","PLAYER_ID"],axis=1)
details.drop_duplicates(subset=["GAME_ID","PLAYER_ID"],keep="first",inplace=True)
In [30]:
details = details.groupby(["GAME_ID","TEAM_ID"]).sum()
details = details.reset_index()
details = details.drop(['PLAYER_ID', 'FG_PCT','FG3_PCT','FT_PCT','PLUS_MINUS'],axis=1)

details["FT_PCT"] = details["FTM"]/details["FTA"]*100
details["FG3_PCT"] = details["FG3M"]/details["FG3A"]*100
details["FG_PCT"] = details["FGM"]/details["FGA"]*100
In [31]:
details = details.sort_values("GAME_ID")   #Sort the rows by GAME_ID, it is an extra check to avoid any bug in the next rows of the code
details["VICTORY"] = ""

for i in range(0,len(details)-1,2):
    if details["PTS"][i] < details["PTS"][i+1]:
        
        details.loc[i , "VICTORY"] = "Loss"
        details.loc[i+1 , "VICTORY"] = "Win"  
        
    else:
        details.loc[i , "VICTORY"] = "Win"
        details.loc[i+1 , "VICTORY"] = "Loss" 
In [32]:
details = pd.merge(details,games,how="left",on="GAME_ID")

EDA¶

Question 1: How have the players changed over time?¶

Let's first analyze the physical characterestics of the players over time.

In [33]:
HeightHistogram = sns.histplot(Players["HtCm"])
HeightHistogram.set(xlabel="Height of players (cm)")
Out[33]:
[Text(0.5, 0, 'Height of players (cm)')]

The gaps between bars can be explained by the fact that due to the nature of the data, there are 2.5 cm gaps between each measurement. We observe that the distribution is slightly left skewed.

Now, in the original Players dataframe, we are only given the year of entrance and the year the players left, so to get the data for the players that played at each individual year, we iterate through every "from-to" pair, and get separate rows for each year.

In [34]:
PlayersByYears = Players.copy()
decompose = lambda x: [i for i in range(int(x["From"]),int(x["To"]))]
PlayersByYears["Years"] = PlayersByYears.apply(decompose,axis=1)
PlayersByYears = PlayersByYears.explode("Years")
PlayersByYears.sort_values(by="Years",inplace=True)
PlayersByYears=PlayersByYears[PlayersByYears["Years"].notnull()]
PlayersByYears["YearsNum"] = pd.to_numeric(PlayersByYears["Years"])
PlayersByYears['YearsNum'] = pd.to_numeric(PlayersByYears['YearsNum'])
PlayersByYears["bins"] = pd.cut(PlayersByYears['YearsNum'], bins=np.arange(1945,2026,5),labels = [str(1945+int(i)*5)+"-"+str(1945+(int(i)+1)*5) for i in range(-1+len(np.arange(1945,2026,5)))])
PlayersByYears = PlayersByYears.loc[~PlayersByYears.index.duplicated(), :]
PlayersByYears
Out[34]:
Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges Country_x HtCm Alpha-2 code Alpha-3 code Country_y Alpha code AlphaThree Code Years YearsNum bins
1297 3295.0 Stan Noszka 1947.0 1949.0 G 6-1 185.0 September 9, 1920 Duquesne PA 185.42 PA USA Pennsylvania PA USA 1947 1947 1945-1950
3095 3855.0 Giff Roux 1947.0 1949.0 F-C 6-5 195.0 June 28, 1923 Kansas MO 195.58 MO USA Missouri MO USA 1947 1947 1945-1950
278 3294.0 George Nostrand 1947.0 1950.0 C-F 6-8 195.0 January 25, 1924 Wyoming NY 203.20 NaN USA New York NY USA 1947 1947 1945-1950
2009 3127.0 Elmore Morgenthaler 1947.0 1949.0 C 6-9 230.0 August 3, 1922 New Mexico Tech, Boston College TX 205.74 NaN USA Texas TX USA 1947 1947 1945-1950
2463 2758.0 John Mahnken 1947.0 1953.0 C 6-8 220.0 June 16, 1922 Georgetown NJ 203.20 NaN USA New Jersey NJ USA 1947 1947 1945-1950
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
171 2050.0 Elijah Hughes 2021.0 2022.0 F 6-5 215.0 March 10, 1998 Syracuse, East Carolina University NY 195.58 NaN USA New York NY USA 2021 2021 2020-2025
1978 2315.0 Mason Jones 2021.0 2022.0 G 6-4 200.0 July 21, 1998 Connors State College, Arkansas TX 193.04 NaN USA Texas TX USA 2021 2021 2020-2025
1434 200.0 LaMelo Ball 2021.0 2022.0 G 6-7 180.0 August 22, 2001 NaN CA 200.66 CA USA California CA USA 2021 2021 2020-2025
3421 4146.0 Jalen Smith 2021.0 2022.0 F 6-10 215.0 March 16, 2000 Maryland VA 208.28 VA USA Virginia VA USA 2021 2021 2020-2025
3465 3011.0 Sam Merrill 2021.0 2022.0 G 6-4 205.0 May 15, 1996 Utah State University UT 193.04 NaN USA Utah UT USA 2021 2021 2020-2025

3598 rows × 19 columns

In [35]:
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
g = sns.FacetGrid(PlayersByYears, row="bins", aspect=20, height=0.8)
g.map_dataframe(sns.kdeplot, x="HtCm",fill=True, alpha=1)
g.map_dataframe(sns.kdeplot, x="HtCm", color='black')
g.fig.subplots_adjust(hspace=-0.3)
g.set_titles("")
g.set_ylabels("Density")
g.set(xlabel="Height of players (cm)")

#g.xaxis.get_label()
g.set(yticks=[])
g.despine(left=True)
Out[35]:
<seaborn.axisgrid.FacetGrid at 0x17244b799d0>
In [36]:
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["HtCm"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Height of players (cm)")
#ax.bar(range(len(data)), values,)

ax.set_ylim([185,205])
plt.show()
In [37]:
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
#sns.regplot(x=range(0,len(PlayersByYears.groupby("bins").mean().index)),y=PlayersByYears.groupby("bins").mean()["Wt"])
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["Wt"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Weight of players (lbs)")
plt.show()

This tells us that the range of the heights of the players over time seem to be decreasing. Taking a look at the mean height over time, the height seems to initially be increasing over time, but then plateaus. We see that recently, the height starts to decrease again (the red bar indicates covid).

The weight of the players however, remains relatively constant. This implies that the shorter basketball players are much better built, and the taller students are more skinny.

Now we take a look at the countries where these players are from.

In [38]:
fig = go.Figure(data=go.Choropleth(
    locations = Players.groupby("AlphaThree Code").count()["Player"].index,
    z = Players.groupby("AlphaThree Code").count()["Player"],
    #text = geo["Country or Area"],
    colorscale=[[0, 'rgb(0,0,0)'], [1,'rgb(255,0,0)']],
    autocolorscale=True,
    reversescale=True,
    #marker_line_color='viridis',
    marker_line_width=0.5,
    colorbar_title = 'Number of Players',
))
fig.show()

it is evident that most players are from the Us. This makers sense, as the NBA is founded and most popular. Very interestingly, tthere are many players from other countries, but lets take a look at the players over time.

In [39]:
sortedGroupeddf = pd.DataFrame(PlayersByYears.groupby(["AlphaThree Code","YearsNum"]).count().reset_index()).sort_values(by="YearsNum")
sortedGroupeddf
Out[39]:
AlphaThree Code YearsNum Unnamed: 0 Player From To Pos Ht Wt Birth Date Colleges Country_x HtCm Alpha-2 code Alpha-3 code Country_y Alpha code Years bins
261 USA 1947 71 71 71 71 71 71 70 71 66 71 71 32 71 71 71 71 71
189 NLD 1947 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1
262 USA 1948 23 23 23 23 23 23 23 23 22 23 23 15 23 23 23 23 23
263 USA 1949 49 49 49 49 49 49 49 49 48 49 49 28 49 49 49 49 49
264 USA 1950 59 59 59 59 59 59 59 59 56 59 59 29 59 59 59 59 59
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
152 JAM 2021 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1
37 BRA 2021 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1
115 GIN 2021 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1
188 NGA 2021 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 2 2
166 LTU 2021 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1

340 rows × 19 columns

In [40]:
import plotly.express as px
fig = px.choropleth(sortedGroupeddf, 
              locations = 'AlphaThree Code',
              color="Player", 
              animation_frame="YearsNum",
              color_continuous_scale="viridis",
              #locationmode='USA-states',
              #scope="usa",
              range_color=(0, 10),
              height=600             )
fig.layout.coloraxis.colorbar.title = 'Number of Players'
fig.show()
In [41]:
import plotly.express as px
px.choropleth(pd.DataFrame(PlayersByYears[PlayersByYears["AlphaThree Code"]=="USA"].groupby(["Alpha code","YearsNum"]).count().reset_index()).sort_values(by="YearsNum"),
              locations = 'Alpha code',
              color="Player", 
              animation_frame="YearsNum",
              color_continuous_scale="viridis",
              locationmode='USA-states',
              scope="usa",
              range_color=(0, 14),
              height=600
             )
In [42]:
sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].plot(xlabel="Years",ylabel = "Number of countries")
Out[42]:
<AxesSubplot:xlabel='Years', ylabel='Number of countries'>

We see that over time, the number of players from different countries have increased. Looking at the US alone, we see that many NBA players are from either the east or west coast. This could be due to the fact that basketball is promoted a lot more in these areas. This could also be due to the fact that most people live near the coasts. Now that we have analyzed the players, lets take a look at the viewers watching the NBA

Question 2: How has the popularity of the sport changed over time?¶

Now, we do not have direct access to the NBA viewership across the years, but what we can do is access the google trends data for Basketball. Now, there is an issue. Since the google trends data is normalized by population, we must correct for it by multiplying it by the number of internet users in each country.

In [43]:
fig = go.Figure(data=go.Choropleth(
    locations = geo['Alpha-3 code'],
    z = np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"])/np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"]).max(),
    text = geo["Country or Area"],
    colorscale="pinkyl",
    autocolorscale=True,
    reversescale=True,
    #marker_line_color='viridis',
    marker_line_width=0.5,
    colorbar_title = 'Normalized Viewership on logarithmic scale',
))
fig.show()

We see a lot of viewers from the United States, and surprisingly a lot of viewers from China. This makes sense sense, since the NBA has been putting deliberate attention and work into capturing the Chinese market.

Lets take a look at the interest across time.

In [44]:
overallTime = pd.DataFrame([pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Category: All categories"],pd.to_numeric(pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Unnamed: 1"])]).T
for i in range(0,12*(len(overallTime["Category: All categories"])//12),12):
    yearlyPlot = sns.lineplot(x = range(12),y= overallTime["Unnamed: 1"].iloc[i:i+12],color="blue")
    yearlyPlot.set(xlabel="months",ylabel = "normalized interest")
In [45]:
allWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_1.csv").reset_index().loc[1:,"Category: Sports"]))
allWeeks.columns = ["2006"]

for i in range(2,18):
    thisWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_"+str(i)+".csv").reset_index().loc[1:,"Category: Sports"]))
    thisWeeks.columns = [str(2006+i-1)]
    allWeeks = pd.concat([allWeeks, thisWeeks], axis=1)
allWeeks
hotmap = sns.heatmap(allWeeks.T)
hotmap.set(xlabel="Weeks",ylabel="Years")
plt.show()

We see that across the years, there is a peak popularity around the 11 week mark. This coincides with the NBA All Stars events that happens every year around this time. We need to note that each year is normalized, so to get the true picture of its popularity across the years, we plot the overall timeline of the google trends data.

In [46]:
overallTime = pd.DataFrame([pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Category: All categories"],pd.to_numeric(pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Unnamed: 1"])]).T
fig = plt.gcf()
fig.set_size_inches(18.5, 6.5)
plt.plot(overallTime["Unnamed: 1"].iloc[:-33])
plt.plot(overallTime["Unnamed: 1"].iloc[-34:], color="red")
plt.xticks(range(0,len(overallTime["Category: All categories"]),24),[overallTime["Category: All categories"].iloc[i] for i in range(0,len(overallTime["Category: All categories"]),24)])
ax = plt.gca()
ax.set(xlabel="Date", ylabel="normalized interest")
plt.show()
In [47]:
plt=reload(plt)

fig = tsaplots.plot_acf(overallTime["Unnamed: 1"], lags=40)

plt.xlabel("Months Lag")
plt.ylabel("Months Lag")

plt.show()

We see the spikes of popularity as explained earlier, however we also see that the popularity of basketball has been increasing over time. Suddenly, due to the covid 19 pandemic (highlighted in red), we see that there is a sharp decrease in its popularity. Despite the virus, the sports popularity seems to continue to grow rapidly as time progresses.

From the autocorrelation plot, we can observe that the data is periodic and correlated to itself with a lag of 12 months. This confirms the data is periodic, and implies the idea that the variation is due to changes across the year.

Now let's analyze the salaries that these players earn, and what are the factors affecting them.

Question 3: How has players salaries been affected?¶

In [48]:
sns.set(rc={'figure.figsize':(41.7,8.27)})
sns.boxplot(x = salaryData["Age"],y=salaryData["Salary"])
sns.stripplot(x = salaryData["Age"],y=salaryData["Salary"],jitter = 0.4)
Out[48]:
<AxesSubplot:xlabel='Age', ylabel='Salary'>

As the players get older, they get paid more. This makes sense. They gain experience, become fan favourites and get better at the game. However, when players get too old, their pay decreases. This could be because with age, players performance tends to decrease due to natural causes. How about the pay for all ages over time?

In [49]:
sns.set(rc={'figure.figsize':(41.7,8.27)})
sns.boxplot(x = salaryData["Season"],y=salaryData["Salary"])
sns.stripplot(x = salaryData["Season"],y=salaryData["Salary"],jitter = 0.4)
Out[49]:
<AxesSubplot:xlabel='Season', ylabel='Salary'>
In [50]:
#sns.regplot()
sns.set(rc={'figure.figsize':(15.7,8.27)})
salyear = sns.regplot(x  = salaryData.groupby("Season").median().index, y = salaryData.groupby("Season").median()["Salary"])
salyear.set(xlabel = "Year")
Out[50]:
[Text(0.5, 0, 'Year')]

We can see that over the years, the salary has generally increased. We also see that the range of the salaries also increase. The maximum salary also tends to increase. This could be due to the rise of popularity of the sport at large.

In [51]:
sns.set(rc={'figure.figsize':(41.7,8.27)})
sns.stripplot(x = salaryData["Pos"],y=salaryData["Salary"],jitter=0.3,alpha=0.2)
sns.violinplot(x = salaryData["Pos"],y=salaryData["Salary"],inner="quartile")
Out[51]:
<AxesSubplot:xlabel='Pos', ylabel='Salary'>
In [52]:
salaryData.groupby("Pos").median()["Salary"].sort_values().plot(kind="bar",ylabel="Median Salary")
Out[52]:
<AxesSubplot:xlabel='Pos', ylabel='Median Salary'>

the five positions are known by unique names: point guard (PG), the shooting guard (SG), the small forward (SF), the power forward (PF), and the center (C)

We see that players who are able to play multiple positions like PG, PF and SG get paid the most, while other positions alone, like PG or PG and SG are not paid as much. Interestingly for C, PF, there seems to be a bimodal distribution, probably because there are different levels of players that exist within this category. However, this data has a lot of outliers,so for further analysis, we will remove the top and bottom 1%.

In [53]:
lst = [ "Salary"]
salaryData_filtered = salaryData.copy()
for i in lst:
    q_low = salaryData_filtered[i].quantile(0.01)
    q_hi  = salaryData_filtered[i].quantile(0.99)
    salaryData_filtered = salaryData_filtered[(salaryData_filtered[i] < q_hi) & (salaryData_filtered[i] > q_low)]
In [54]:
salaryData_filtered = salaryData_filtered.drop(["Unnamed: 0","Unnamed: 30","Trp Dbl"],axis=1)
In [55]:
sns.set(rc={'figure.figsize':(15.7,8.27)})
sns.pairplot(salaryData_filtered[["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%", "Salary"]])
Out[55]:
<seaborn.axisgrid.PairGrid at 0x1725c3baf70>
In [56]:
plt.barh(np.arange(len(np.abs(salaryData_filtered.corr()["Salary"]).sort_values()))[:-2],np.abs(salaryData_filtered.corr()["Salary"]).sort_values().values[:-2])
plt.yticks(np.arange(len(np.abs(salaryData_filtered.corr()["Salary"]).sort_values()))[:-2],np.abs(salaryData_filtered.corr()["Salary"]).sort_values().index[:-2])
ax = plt.gca()
ax.set(xlabel="Correlation Coefficient")
plt.show()

We see that there is no strong correlation between the salary and any other factor, such as 3 point attempts, and other performance-related metrics. The strongest metric is the season, which as explained earlier, could be due to the rise of popularity of the sport.

In [57]:
fig = plt.figure(figsize=(20,16))
gs = fig.add_gridspec(5, 8, hspace=0.2, wspace=0)
axes = gs.subplots(sharex=False, sharey=False)

j = 0
for i in salaryData_filtered["Season"].unique():
    j=j+1
    features = salaryData_filtered[salaryData_filtered["Season"]==i][["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%","Salary"]].copy()
    mlm = LinearRegression()
    features.dropna(axis=0, inplace= True)
    X_train2, X_test2, y_train2, y_test2 = train_test_split(features[["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%"]],features["Salary"], test_size=0.2, random_state=0)
    mlm.fit(X_train2, y_train2)
    
    yhat2 = mlm.predict(X_test2)
    gg = sns.kdeplot(y_test2, color='r', label='Actual Value',  ax = axes[j//8,j%8])
    gg = sns.kdeplot(yhat2,  color='b', label='Fitted Value',  ax = axes[j//8,j%8])
    gg.set(xlabel=None)
    gg.set(ylabel=None)
plt.show()
In [58]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

fig = plt.figure(figsize=(20,16))
gs = fig.add_gridspec(5, 8, hspace=0.2, wspace=0)
axes = gs.subplots(sharex=False, sharey=False)

j = 0
for i in salaryData_filtered["Season"].unique():
    j=j+1
    features = salaryData_filtered[salaryData_filtered["Season"]==i][["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%","Salary"]].copy()
    mlm = tree.DecisionTreeRegressor(max_depth=4, criterion="squared_error") 
    features.dropna(axis=0, inplace= True)
    X_train2, X_test2, y_train2, y_test2 = train_test_split(features[["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%"]],features["Salary"], test_size=0.2, random_state=0)
    mlm.fit(X_train2, y_train2)
    
    yhat2 = mlm.predict(X_test2)
    gg = sns.kdeplot(y_test2, color='r', label='Actual Value', ax = axes[j//8,j%8])
    gg = sns.kdeplot(yhat2,  color='b', label='Fitted Value', ax = axes[j//8,j%8])
    gg.set(xlabel=None)
    gg.set(ylabel=None)


plt.show()

Neither a multiple linear regressor model, nor a Random Forest approach seems to accurately predict the salary of the NBA players for eacg year. This is probably due to other factors like how popular the athletes are, and their performance in each iondividual game.

Plotting the salary distribution over time, we get this:

In [59]:
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
g = sns.FacetGrid(salaryData_filtered, row="Season", aspect=20, height=0.8)
g.map_dataframe(sns.kdeplot, x="Salary",fill=True, alpha=1)
g.map_dataframe(sns.kdeplot, x="Salary", color='black')
g.fig.subplots_adjust(hspace=-.9)
g.set_titles("")
g.set_ylabels("Density")
#g.xaxis.get_label()
g.set(yticks=[])
g.despine(left=True)
Out[59]:
<seaborn.axisgrid.FacetGrid at 0x17262ef3ee0>

Over time, the distribution of the salaries spreads out! This is very interesting, and could be due to the sharp increase of popularity of certain players like Michael Jordan, Lebron James, etc.

This causes the NBA to pay them more, causing the range of the distribution to increase.

In [60]:
sns.set(rc={'figure.figsize':(15.7,13)})
sns.heatmap(salaryData_filtered.corr(),vmin=-1, vmax=1, cmap="vlag")
Out[60]:
<AxesSubplot:>

Unsurprisingly, we note that there is high correlation between different shooting metrics. There is a weak correlation between other paramters like "Season", "Salary" and "Age".

Question 4: What is the evolution of shooting strategy within the court?¶

In [61]:
plt.plot(salaryData_filtered.groupby("Season").mean()["3PA"])
ax = plt.gca()
ax.set(xlabel="years",ylabel="Mean 3 point attempts")
Out[61]:
[Text(0.5, 0, 'years'), Text(0, 0.5, 'Mean 3 point attempts')]
In [62]:
plt.plot(salaryData_filtered.groupby("Season").mean()["2PA"])
ax = plt.gca()
ax.set(xlabel="years",ylabel="Mean 2 point attempts")
Out[62]:
[Text(0.5, 0, 'years'), Text(0, 0.5, 'Mean 2 point attempts')]
In [90]:
fig, ax = plt.subplots(1,3,figsize=(15,8),dpi=200)
sns.set_theme(style="white", rc={"axes.facecolor": (1, 1, 1, 1)})

plt.suptitle("Link between percentage of each types of shots and winning")

sns.scatterplot(ax = ax[0],data=details,x="FG3A",y="FG3_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[0].set(ylabel="Three point shots percentage", xlabel="Three point shots attempted")

sns.scatterplot(ax = ax[1],data=details,x="FGA",y="FG_PCT",hue="VICTORY",style="VICTORY",alpha=0.8) 
ax[1].set(ylabel="Field goal percentage", xlabel="Field goal attempted")
 
sns.scatterplot(ax = ax[2],data=details,x="FTA",y="FT_PCT",hue="VICTORY",style="VICTORY",alpha=0.8) 
ax[2].set(ylabel="Free throw percentage", xlabel="Free throw attempted")

plt.show()
In [64]:
fig = plt.figure()
#ax = Axes3D(fig)

ax = fig.add_subplot(111, projection='3d')
my_color = details["VICTORY"].unique()
ax.scatter(details['FG3_PCT'][details["VICTORY"]=="Loss"], details['FG_PCT'][details["VICTORY"]=="Loss"], details['FT_PCT'][details["VICTORY"]=="Loss"], c="red", s=60)
ax.scatter(details['FG3_PCT'][details["VICTORY"]=="Win"], details['FG_PCT'][details["VICTORY"]=="Win"], details['FT_PCT'][details["VICTORY"]=="Win"], c="blue", s=60)
ax.set_xlabel("FG3_PCT")
ax.set_ylabel("FG_PCT")
ax.set_zlabel("FT_PCT")
ax.view_init(45,135)
plt.show()

The NBA changed the distance of the three-point line in the 1990s. The original intention was to create more high scoring games. We see that due to this change, almost immediately, the amount of 3 Point attempts shot up dramatically. The number of 2 point attempts on the other hand, started decreasing with the years. Critiques say that this change has made the game less aggressive, however the change is still very well appreciated (as can be seen by the increased viewership).

Taking a look at the victories, it becomes very apparent that a high number of 3 point attempts helps

Results Findings & Conclusion¶

1. How have the players changed over time?¶

Have physical attributes like height and weight changed?¶
In [65]:
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["HtCm"],color=["blue"]*15+["red"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Height of players (cm)")
#ax.bar(range(len(data)), values,)

ax.set_ylim([185,205])
plt.show()

Players heights tend to increase over time, from 1945-1985. Note that the origin of the y axis starts at 185 cm. This is to emphasize the difference in height, as small differences in height are definitely significant. The height plateaus from then onwards up until the 2020 bin. Here we suddenly see a sharp decrease in the average height.

This was the year when the pandemic hit, and it possibly hindered the outreach of the players, and the pools of players that recruiters select from.

In [66]:
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
#sns.regplot(x=range(0,len(PlayersByYears.groupby("bins").mean().index)),y=PlayersByYears.groupby("bins").mean()["Wt"])
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["Wt"],color=["blue"]*15+["red"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Weight of players (lbs)")
plt.show()

The story, however, changes when we take a look at the weight of the players. We see that the weight of the players remain relatively constant. Throughout the years, regardless of the height of the players, or the pandemic, the weight of the players do not seem to fluctuate too greatly. This implies that the players nowadays are a lot more lean and tall, then the players used to be.

Have there been an increase of players from other countries?¶
In [67]:
import plotly.express as px
fig = px.choropleth(sortedGroupeddf, 
              locations = 'AlphaThree Code',
              color="Player", 
              animation_frame="YearsNum",
              color_continuous_scale="viridis",
              #locationmode='USA-states',
              #scope="usa",
              range_color=(0, 10),
              height=600             )
fig.layout.coloraxis.colorbar.title = 'Number of Players'
fig.show()

Note: The scale is maxed out at 10 players to show exaggerate the difference between countries besides the U.S.

In [68]:
sns.regplot(x=sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].index,y=sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].values, order=2,color="Blue")
plt.plot(sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].index,sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].values)
ax = plt.gca()
ax.annotate('Increase in number of \ncountries players are from', xy=(2015, 14), xytext=(1980, 6),
            arrowprops=dict(facecolor='black',
                            connectionstyle="angle3,angleA=0,angleB=-130"));

note: this is a quadratic fit, meant to illustrate that the number of countries are increasing, this is not an actual trend prediction.

We can see an increase of the number of countries that players are from. According to analysts, this is probably NBAs attempts at pandering to foreign crowds to try to increase worldwide viewership. This strategy worked incredibly well with NBAs massive hold on the Chinese market. Ever since the addition of Yao Ming, Chinese viewership of the sport skyrocketed. More details on this will be explained in the next question.

2. How has the popularity of the sport changed over time?¶

Where is the game most viewed?¶
In [69]:
fig = go.Figure(data=go.Choropleth(
    locations = geo['Alpha-3 code'],
    z = np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"])/np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"]).max(),
    text = geo["Country or Area"],
    colorscale="pinkyl",
    autocolorscale=True,
    reversescale=True,
    #marker_line_color='viridis',
    marker_line_width=0.5,
    colorbar_title = 'Normalized Viewership on logarithmic scale',
))
fig.show()

We see that the U.S. is the top viewer of the sport (probably because the NBA originated from there). There is also substantial viewership in China. According to Chinese people online from quora, this has 2 main reasons:

  1. Yao Ming, who is once the best Center in NBA, came from Shanghai, China. He's one of the best athlete China has ever had. There is no soccer player in China reached his achievement so far. He is a massive inspiration to many Chinese kids areound the nation
  2. There was a really famous Japanese Manga/comic called SlamDunk influenced a lot of kids to play basketball. This especially appealed to the younger population of the country, that later grew up and played/watched it a lot.
How does the game's popularity changed over time?¶

The popularity of the sport varies throughout the duration of the year, as well as across the years. Analyzing the popularity during each year itself we see the following:

In [70]:
allWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_1.csv").reset_index().loc[1:,"Category: Sports"]))
allWeeks.columns = ["2006"]

for i in range(2,18):
    thisWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_"+str(i)+".csv").reset_index().loc[1:,"Category: Sports"]))
    thisWeeks.columns = [str(2006+i-1)]
    allWeeks = pd.concat([allWeeks, thisWeeks], axis=1)
allWeeks
px.imshow(allWeeks.T,labels=dict(x="Week", y="Year", color="Normalized popularity"),aspect="auto")
#hotmap.set(xlabel="Weeks",ylabel="Years")
#plt.show()

We see that across the years, there is a peak popularity around the 11 week mark. This coincides with the NBA All Stars events that happens every year around this time. We need to note that each year is normalized, so to get the true picture of its popularity across the years, we plot the overall timeline of the google trends data.

In [71]:
# Create figure
fig = go.Figure()
overallTime = pd.DataFrame([pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Category: All categories"],pd.to_numeric(pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Unnamed: 1"])]).T

fig.add_trace(
    go.Scatter(y=overallTime["Unnamed: 1"].iloc[:-33], x=overallTime["Category: All categories"].iloc[:-33], name='Interest before covid'))
fig.add_trace(
    go.Scatter(y=overallTime["Unnamed: 1"].iloc[-34:], x=overallTime["Category: All categories"].iloc[-34:], name='Interest after covid'))

# Sets title
fig.update_layout(
    title_text="Timeseries of number of games released",
    yaxis_title='Number of games',
    xaxis_title='Date'
)

# Adds a range slider
fig.update_layout( 
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

fig.show()

We see the spikes of popularity as explained earlier, however we also see that the popularity of basketball has been increasing over time. Suddenly, due to the covid 19 pandemic (highlighted in red), we see that there is a sharp decrease in its popularity. Despite the virus, the sports popularity seems to continue to grow rapidly as time progresses.

3. How has players salaries been affected?¶

What are the major contributing factors to a players salary?¶
In [72]:
sns.set(rc={'figure.figsize':(41.7,8.27)})
#sns.stripplot(x = salaryData["Pos"],y=salaryData["Salary"],jitter=0.3,alpha=0.2)
#sns.violinplot(x = salaryData["Pos"],y=salaryData["Salary"],inner="quartile")

fig = go.Figure()
for i in salaryData["Pos"].unique():
    fig.add_trace(go.Violin( y=salaryData["Salary"][salaryData["Pos"] == i], x=salaryData["Pos"][salaryData["Pos"] == i],name=i,
                            box_visible=True,
                            meanline_visible=True))
#fig.update_traces(meanline_visible=True)
#fig.update_layout(violingap=0, violinmode='overlay')
#fig.update_layout(
#    autosize=False,
#    width=2000,
#    height=800,)

fig.show()
In [73]:
salaryData.groupby("Pos").median()["Salary"].sort_values().plot(kind="bar",ylabel="Median Salary")
Out[73]:
<AxesSubplot:xlabel='Pos', ylabel='Median Salary'>

We can see that the position the player plays affects their salary greatly. Players who take on the PG and SG roles tend to have less then 20% of the players who take on the PG, SG and PF position.

How has salaries changed over the years?¶
In [74]:
#sns.regplot()
sns.set(rc={'figure.figsize':(15.7,8.27)})
salyear = sns.regplot(x  = salaryData.groupby("Season").median().index[:], y = (salaryData.groupby("Season").median()["Salary"] * inflation_data)[:-1],color='b')
salyear = sns.regplot(x  = salaryData.groupby("Season").median().index[-1:], y = (salaryData.groupby("Season").median()["Salary"] * inflation_data)[-2:-1],color='r')

salyear.set(xlabel = "Year")
Out[74]:
[Text(0.5, 0, 'Year')]

We see that, even when accounting for inflation, the median salary of the NBA player has increased over time. This could be due to the increased popularity over time. With an increasing global market (especially from countries like China and Brazil), the NBA can afford to pay their players more over time.

In [75]:
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
g = sns.FacetGrid(salaryData_filtered, row="Season", aspect=20, height=0.8)
g.map_dataframe(sns.kdeplot, x="Salary",fill=True, alpha=1)
g.map_dataframe(sns.kdeplot, x="Salary", color='black')
g.fig.subplots_adjust(hspace=-.9)
g.set_titles("")
g.set_ylabels("Density")
#g.xaxis.get_label()
g.set(yticks=[])
g.despine(left=True)
Out[75]:
<seaborn.axisgrid.FacetGrid at 0x17265a3e1f0>

We also see that with time, the range of payers pay increased dramatically. This is due to a select few NBA super-stars like Kevin Durant, Kobe Bryant, and Yao Ming being paid a heavy amount, while the rest being paid normally. This causes a greater discrepancy between players pay, and could lead to inequality among players.

4. What is the evolution of shooting strategy within the court?¶

During the 1990s, the NBA pushed the 3 point line back. As can be seen in the picture below.

image-2.png

The impact of this was that there was a sudden jump in the number of 3 point shot attempts made, and a sudden decrease in 2 point attempts. Since players be a lot less accurate, but still score high with 3 point shots, teams took great advantage of this.

In [76]:
plt.plot(salaryData_filtered.groupby("Season").mean()["3PA"].loc[:1993],color="Red")
plt.plot(salaryData_filtered.groupby("Season").mean()["3PA"].loc[1993:],color="Blue")
Out[76]:
[<matplotlib.lines.Line2D at 0x17261f648b0>]
In [77]:
plt.plot(salaryData_filtered.groupby("Season").mean()["2PA"].loc[:1993],color="Red")
plt.plot(salaryData_filtered.groupby("Season").mean()["2PA"].loc[1993:],color="Blue")
Out[77]:
[<matplotlib.lines.Line2D at 0x17261f90160>]

We see this drop quite dramatically in the graphs above, the average 3 point attempts per player sky rocketed from nearly 2.5 to more then 10, in just a few years. On the other had, the number of 2 point attempts, shot down from nearly 30 attempts to less then 20 attempts within a few years. The blue portion is after the change was made, the red portion is before.

In [89]:
fig, ax = plt.subplots(1,3,figsize=(15,8),dpi=200)
sns.set_theme(style="white", rc={"axes.facecolor": (1, 1, 1, 1)})

plt.suptitle("Link between percentage of each types of shots and winning")

sns.scatterplot(ax = ax[0],data=details,x="FG3A",y="FG3_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[0].set(ylabel="Three point shots percentage", xlabel="Three point shots attempted")

sns.scatterplot(ax = ax[1],data=details,x="FGA",y="FG_PCT",hue="VICTORY",style="VICTORY",alpha=0.8) 
ax[1].set(ylabel="Field goal percentage", xlabel="Field goal attempted")
 
sns.scatterplot(ax = ax[2],data=details,x="FTA",y="FT_PCT",hue="VICTORY",style="VICTORY",alpha=0.8) 
ax[2].set(ylabel="Free throw percentage", xlabel="Free throw attempted")

plt.show()

We also see that teams with higher amounts of three point attempts, and higher three point accuracy, they tend to win a lot more. The same applies (but not as much) for Field goal and Free throw shots.

Recommendations or Further Works¶

From this analysis, it is apparent that to improve their ratings, the NBA should cast more international players to draw from a larger audience. NBA can consider drafting more players from nations like India or Africa, that has a large young population that can be inspired by these NBA players and tune in more.

Teams in the future should also work more on improving players 3 point shots. More directed training towards improving the number of attempts made, as well as the accuracy of those shots could lead to more victories for those teams.

In the future, analyzing how popular players are from their social media would be really interesting. It is well known that players who have a larger social media following tend to be paid more. Analyzing performance with popularity could be interesting.

Another interesting thing to analyze would be how different teams treat/trade their players. Analyzing the different coaching techniques and their impacts on the overall game could provide really interesting and useful insights on the dynamics of the NBA, and its impact.

References¶

  1. https://www.youtube.com/watch?v=GHpFjtuAYLQ
  2. https://apanalytics.shinyapps.io/knarsu3/
  3. https://www.basketball-reference.com/players/a/allenra02.html#all_all_salaries
  4. https://www.kaggle.com/code/xuannaselli/it-s-raining-threes-in-nba-is-it-worth-it#Machine-Learning-model
  5. https://en.wikipedia.org/wiki/Basketball_positions
  6. https://www.rookieroad.com/basketball/shot-types/field-goals/
  7. https://www.youtube.com/watch?v=2p3NIR8LYoo
In [ ]: